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A SMOOTHED DUAL APPROACH FOR VARIATIONAL 
WASSERSTEIN PROBLEMS 

MARCO CUTURI AND GABRIEL PEYRE* 


Abstract. Variational problems that involve Wasserstein distances have been recently proposed 
to summarize and learn from probability measures. Despite being conceptually simple, such prob¬ 
lems are computationally challenging because they involve minimizing over quantities (Wasserstein 
distances) that are themselves hard to compute. We show that the dual formulation of Wasserstein 
variational problems introduced recently by [13] can be regularized using an entropic smoothing, 
which leads to smooth, differentiable, convex optimization problems that are simpler to implement 
and numerically more stable. We illustrate the versatility of this approach by applying it to the 
computation of Wasserstein barycenters and gradient flows of spacial regularization functionals. 


1. Introduction. To compare two histograms in the probability simplex, infor¬ 
mation divergences—the Hellinger and X 2 distances, the Kullback-Leibler and Jensen- 
Shannon divergences—have the advantages of being simple and fast to compute. Opti¬ 
mal transport distances [34, §7]— a.k.a. Wasserstein or earth mover’s distances [29] — 
require more computational effort but are more versatile: by incorporating in their 
definition a metric between the bins of these histograms, they can compare sparse 
histograms even if their support do not overlap significantly, which can be crucial 
when their dimension is large. Their versatility comes, however, at a price: comput¬ 
ing optimal transport distances requires solving a costly network flow problem, whose 
cost scales super-cubicly with the dimension of the considered histograms. That cost 
becomes even more of a drawback if one attempts to study a family of histograms 
using the optimal transport geometry. 

Despite this computational complexity, optimal transport is becoming increas¬ 
ingly popular in imaging sciences and related fields, such as for instance image re¬ 
trieval [29, 26], image interpolation [10], computational geometry [23, 21], color image 
processing [9, 35], image registration [22] and machine learning [15]. 

1.1. Variational Wasserstein Problems. Many learning tasks on histograms, 
such as averaging or clustering them, can be framed as variational problems that in¬ 
volve distances between pairs of histograms. These problems are easily solved when 
such divergences are Bregman divergences [19, 4, 24], but they are far more challeng¬ 
ing when considering instead Wasserstein distances. [2] studied the first problem of 
this type, the Wasserstein barycenter problem (WBP), and showed that it is related 
to the multi-marginal optimal transport problem. More recently, [32] proposed the 
Wasserstein propagation-on-graphs framework and showed that it involves a very large 
linear program, which can only be feasibly solved for small dimensions and families of 
histograms. Variational problems that involve Wasserstein distances have, however, 
the potential to impact a very wide range of applications. Beyond their applicability 
to unsupervised learning problems and their ramifications into clustering mentioned 
in [16], they have found usage in statistics to develop population estimators [8], com¬ 
puter graphics to perform image modification [35, 9, 31] and computer vision [36] to 
summarize complex visual signals. 

Beside the computation of barycenters, it is also possible to integrate Wasserstein 
distances into more general variational problems. For instance, optimal transport 
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distances are used as a data fidelity to perform image denoising [12, 20], image seg¬ 
mentation [28, 33, 30], and Radon transform reconstruction [1, 7]. 

Our aim in this paper is to propose a computational framework that is both 
scalable and flexible enough to minimize energies that involve not only Wasserstein 
distances, but also more general functions such as regularization terms. To do so, we 
exploit regularization, Legendre duality and the usual toolbox of convex optimization. 

1.2. Previous Works. [16] proposes to leverage the entropic regularization of 
Wasserstein distances introduced by [15] to study the WBP. Their formulation re¬ 
quires, however, to run a numerical subroutine, the Sinkhorn fixed-point iteration, to 
evaluate these objectives and compute their gradients. On the other hand, [13] show 
that the Fenchel-Legendre dual of the Wasserstein distance as well as its subgradients 
can be obtained in closed form using nearest-neighbor assignments, that is without 
having to solve a single optimal transport problem. The authors do, however, strug¬ 
gle with non-differentiable objective functions and use a L-BFGS first order scheme. 
More recently, [7] have proposed an a generalized version of Sinkhorn’s algorithm to 
compute barycenters based on Bregman’s projections. This approach is useful for the 
barycenter problem, but cannot be easily adapted to solve more advanced variational 
problems. 

A typical use of such more involved variational problems is the approximation of 
gradient flows. As initially shown by [18], it is indeed possible to approximate solu¬ 
tions of a large family of partial differential equations by iteratively minimizing some 
energy functional plus the Wasserstein distance to the previous iterates. We refer to 
Section 4.6 for more details and references about these schemes. Following the method 
introduced in [27], one can approximate these iterations using entropic regularization. 
Our dual approach is crucial to be able to tackle non-separable energies, such as for 
instance the total variation of images. 

Another illustration of the usefulness of our dual approach is the application to 
image segmentation developed in [28]. Note that this application requires to compute 
the gradient of the dual of the smoothed Wasserstein distance with respect to two 
histograms. This formula is provided in Appendix A. 

1.3. Contributions. Our main contribution is to combine the strengths of the 
dual formulation of [13] with the smoothing strategy laid out by [16] to obtain a smooth 
optimization problem whose objectives and derivatives can be computed in closed 
form in §2. We show that this approach can be readily used to compute Wasserstein 
barycenters in §3, and explain why using regularized Wasserstein distances might be 
beneficial to recover smooth solutions. We proceed with more general energies that 
involve not only Wasserstein distances, but also more generally spatial regularization 
of barycenters and gradient flows, in §4. 

The source code to reproduce the numerical illustration of this article can be 
found online^. 

1.4. Notations. When used on matrices, functions such as log or exp are always 

applied element-wise. For two matrices (or vectors) A, B of the same size, AoB (resp. 
A/B) stands for the element-wise product (resp. division) of A by B. If is a vector, 
diag(i4) is the diagonal matrix with diagonal u. is the (column) vector of 

ones. 
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2. Legendre Transforms of the Smoothed Wasserstein Distance. We 

introduce in this section the entropic regularization of the Wasserstein distance, study 
its Legendre transform and show that it admits a simple closed form. 

2.1. Optimal Transport with Entropic Smoothing. We consider two dis¬ 
crete probability distributions on the same space, represented through their his¬ 
tograms p, g G Hn of ^ values. We also introduce a symmetric cost matrix M = 
G Each element Mij accounts for the (ground) cost of moving 

mass from bin i to bin j. In many applications of optimal transport, the cost ma¬ 
trix M is defined through n points {xi)i taken in a metric space such that 

Mij = D{xi^XjY^ p ^ 1. Note however that we make no assumption on M in this 
paper other than the fact that it is symmetric and non-negative. 

Given p, g, the set of couplings U (p, g) and the discrete entropy of any coupling 
in that set are defined as, 

U{p, q) {X e Ry ” ; Xln = P, X^ln = q} , E{X) ^ h{Xij), ( 2 .I) 


where \/x > 0, h(x) =^xlogx, h(0) = 0. We follow [15] ’s approach and introduce a 
entropy-regularized optimal transport problem: 

W^(p, q) mm (M, X) - -fE{X), (2.2) 

XeUip,q) 

where 7 ^ 0 . For 7 = 0, one recovers the usual optimal transport problem, which is a 
linear program. Wo is known as the Wasserstein distance (or Earth Mover’s Distance, 
EMD) between p and g. For 7 > 0, Problem (2.2) is strongly convex and admits 
a unique optimal coupling X*. [15] called the resulting cost (M, X^) the Sinkhorn 
divergence between p and g. 

While X* is not necessarily unique for 7 = 0, we show in the following proposition 
that in the small 7 limit, the regularization captures the maximally entropic coupling. 

Proposition 2.1. One has Wq as j ^ 0, and denoting X* the unique 

solution of (2.2), one has 

X; Xo" = argmax {E{X) ; (M, X) = Wo(p, g)} . (2.3) 

xeu(p,q) 


Proof We consider a sequence ( 7 ^)^ such that 7 ^ ^ 0 and 7 ^ > 0. We denote 
X£ = Xf ^. Since U (p, g) is bounded, we can extract a sequence (that we do not relabel 
for sake of simplicity) such that X^ ^ X*. Since t/(p, g) is closed, X* G U{p^q). We 
consider any X such that (M, X) = Wo(p, g). By optimality of X and Xi for their 
respective optimization problems (for 7 = 0 and 7 = 7 ^), one has 

0 ^ (M, X,) - (M, X) ^ ji{E{X£) - E{X)). (2.4) 

Since E is continuous, taking the limit i -hoc in this expression shows that 
(M, X*) = (M, X) so that X* is a feasible point of (2.3). Furthermore, dividing 
by 7 ^ in (2.4) and taking the limit shows that E{X) ^ F(X*), which shows that 
X* is a solution of the maximization (2.3). Since the solution Xq to this program 
is unique by strict convexity of — F, one has X* = Xq, and the whole sequence is 
converging. □ 
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[16] provided a dual expression for Wj. The proof of that result follows from an 
application of Fenchel-Rockafellar duality to the primal problem (2.2). The indicator 
function of a closed convex set C is ic{x) = 0 for x G C and ic{x) = Too otherwise. 
Proposition 2.2. One has 

W^ip, q) = max {u, p) + {v, q) - B{u, v), (2.5) 


B{u,v) =■ 


7Ei,j exp(yui + Vj - Mij) - 1), i/7 > 0; 

(•Cm */ 7 = 0, where Cm =' {{u, v) ■, Ui + Vj < Mij} . 


When 7 > 0, this regularization results in a smoothed approximation of the 
Wasserstein distance with respect to either of its arguments, as shown below. To 
simplify notations, let us introduce the notation Hq{p), the Wasserstein distance of 
any point p to a fixed histogram q G 

^pej:n, =Vep,</). 

Note that Hq is a convex function for all 7 ^ 0. When 7 > 0, Hq has the following 
properties, which follow from the direct differentiation of expression (2.5): 

Proposition 2.3. For 7 > 0 and (p, q) ^ with p > 0^ q > 0, Hq is at 

p and VHq{p) = where is the unique solution of (2.5) satisfying {iF^ 1^) = 0. 

Computing both Hq and its gradient requires thus the resolution of the optimiza¬ 
tion problem in Eq. (2.5), which can be solved with a Sinkhorn fixed-point itera¬ 
tion [15] as remarked by [16, §5]. This computation can be avoided when studying 
the Fenchel-Legendre conjugate of Hq^ as shown below. 

2.2. Legendre Transform with Respect to One Histogram. The goal of 
this section is to show that the Fenchel-Legendre transform of 

H;{g) = m^{g,p)-H,{p), 

has a closed form. This result was already known when 7 = 0, that is for the original 
Wasserstein distance. [13, Prop. 4.1] showed indeed that computing iL* only requires 
a sequence of nearest-neighbor assignments. We show that for 7 > 0, these nearest- 
neighbor assignments are replaced by soft assignments. 

Compared to the primal smoothed Wasserstein distance the computation of 
both Hq and its derivatives can be carried out without having to solve a matrix¬ 
scaling problem. These properties are at the core of the computational framework we 
develop in this paper. 

Theorem 2.4 (Legendre Transform of Hq). For 7 > 0, the Fenehel-Legendre 
dual funetion H* is . Its gradient funetion \/Hq{') 251/7 Lipsehitz. Its value, 
gradient and Hessian at g ^ are, writing a = e^^^ and K = e~^l^, 

H;{g) = 7 {E{q) + {q, log Ka)), VH;{g) = « o (if G I]„, 

(g) = 1 diag (a o K ^) - 1 dmg{a)K diag (^ diag(a). 
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Proof. Writing Hq^M{p) in place of Hq{p) to make explicit the dependency on M, 
one has 

= max {g, p) - max {u, p) + {v, q) - B{u,v) 

pe^n U,V 

= max — max {u', p) + {v, q) — B{u' + g, v) 

pe'^n u',V 

= max - Hg^M-gi^ip) = - min min (M - gl'^, X) - 'fE{X). 

pEi^n pEi^n X^U(^p,q) 

This leads to an optimal transport problem which is only constrained by one marginal, 
Km{9) = - ^ min (M - gl^, X) - ^E{X) (2.7) 

which can be explicitly solved by writing first order conditions for (2.7) to obtain 
that, at the optimum, we necessarily have log(X* ) = ^{pi — Mij — pj) — 1 for some 

vector of values p G Therefore X* has the form X'^ = diag(a)i^diag(e^/'^“^), 
using the notation a = . Because of the marginal constraint that 

the rightmost diagonal matrix must necessarily be equal to dmg{q/Ka)^ and thus 
X'^ = dmg{a)K dmg{q/Ka). Therefore, the Legendre transform has a closed 

form, 

Km{ 9) = -{M - gt^, X*) + ^E{X*) (2.8) 

which can be simplified to 

KMi9) = -7lJ {{Ka) o h{q/Ka)). 

by using the fact that X* = didig{a)K didig{q/Ka). This equation can be simplified 
further to obtain the expression provided in Eq. (2.6). Using Eq. (2.8), we have that 

VHlM=X*l=ao[K^). 

Computations for the Hessian follow directly, and result in the expression given 
Eq. (2.6). Since the Hessian can be written as the difference of two positive defi¬ 
nite matrices, one diagonal and the other equal to the product of a matrix times its 
transpose, the trace of is upper bounded by the trace of the first term, which 

is equal to ^ (recall that ViL*^(^) is in the simplex), which proves the ^-Lipschitz 
continuity of the gradient of H^. □ 

In some settings, such as the Wasserstein propagation framework of [32], the aim 
is to minimize Wasserstein distances with respect to two variable arguments. We 
provide the formulation for the corresponding Legendre transform in Theorem A.l in 
the Appendix. 

2.3. Un-regularized Case. The result of Theorem 2.4 is derived in the un¬ 
regularized case {i.e. 7 = 0) in [13]. Eor the sake of comparison, let us now recall 
this result using our notations. Given a cost matrix M G and a vector g G 

we introduce for i ^ n the set NM,g{i) = argmin^. Mik — Pi- In other words, NM,g{i) 
is the set of nearest-neighbors of i with respect to the vector of distances Mik offset 
by -Pi^ 

A map o-M,g • ^ is called a nearest-neighbor map if the vector 

o'M,g{'^) only has non-zero values on indices in NM,g{'i), namely 

[^M,g{i)]j 7 ^ 0 j G NM,g{i)^ 
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If NM,g{i) is a singleton {j} (the minimization min/. Mik — gi admits only one optimal 
solution) then is necessarily equal to a Dirac histogram 5j (we call a Dirac 

histogram a histogram with mass 1 on only one coordinate, of index j in this case). 
When NM,g{i) has more than one element, ties have to be taken care off, and this 
can be carried out arbitrarily, for instance by dividing the mass equally among those 
nearest neighbors, or by only choosing arbitrarily one of them. We can now recall the 
result of [13]: 

Proposition 2.5 (Carlier et al. 2014, Prop. 4.1). For 7 = 0 and a nearest- 
neighbor map cTM.g, the Fenehel-Legendre dual funetion admits the following veetor 
in its sub-differential dH*{g) at g ^ 

i^d 


Note that Sq{g) is in The value of Hq{g) is {Sq{g), g). 

3. Smooth Dual Algorithms For the Wasserstein Barycenter Problem. 

In this section, we use the properties of the Legendre transform of the Wasserstein 
distance as detailed in Section §2 to solve the Wasserstein Barycenter Problem. 

3.1. Smooth Dual Formulation of the WBP. Following the introduction 
of the Wasserstein Barycenter Problem (WBP) by [2], [16] introduced the smoothed 
WBP with 7 -entropic regularization ( 7 -sWBP) as 

N 

min . (3.1) 

where (gi,..., ^at) is a family of histograms in When 7 = 0, the 7 -sWBP is exactly 
the WBP. In that case, problem (3.1) is in fact a linear program, as discussed later in 
§3.4. When 7 > 0 the 7 -sWBP is a strietly convex optimization problem that admits 
a unique solution, which can be solved with a simple gradient descent as advocated 
by [16]. They show that the N gradients [ViL^^(p)]^^^ can be computed at each 
iteration by solving N Sinkhorn matrix-scaling problems. Because these gradients are 
themselves the result of a numerical optimization procedure, the problem of choosing 
an adequate threshold to obtain sufficiently precise gradients arises as a key parameter 
in that approach. We take here a different route to solve the 7 -sWBP, which can be 
either interpreted as a smooth alternative to the dual WBP studied by [13], or the 
dual counterpart to the smoothed WBP of [16]. 

Theorem 3.1. The baryeenter solving (3.1) satisfies 

Vk = l,...,N, p* = VHl{gl) (3.2) 

where {gX)k ^^2/ solution of the smoothed dual WBP: 

» T^kH* {gk) s.t y'Afcgfc=0. (3.3) 

k k 

Proof We re-write the barycenter problem 

min W AfciJq^ (pk) s.t. pi = ... = p^ 

Pl,---,PN ^^ 
k 
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whose Fenchel-Rockafelar dual reads 


_min VAfciJ* Ofc/Afe) s.t. y^g^ = 0. 
k k 

Since the primal problem is strictly convex, the primal-dual relationships show that 
the unique solution of the primal can be obtained from any solution via the 

relation p'^ = (^^/A/c). One obtains the desired formulation using the change of 

variable gk=gkl^k‘ □ 

Theorem 3.1 provides a simple approach to solve the y-sWBP: Rather than mini¬ 
mizing directly the sum of regularized Wasserstein distances in Eq. (3.1), this formu¬ 
lation only involves minimizing a strictly convex function with closed form objectives 
and gradients. 

Parallel Implementation. The objectives, gradients and Hessians of the Fenchel- 
Legendre dual can be computed using either matrix-vector products or element¬ 
wise operations. Given N histograms {qk)ki N dual variables {gk)k and N arbitrary 
vectors {xk)ki the computation of N objective values {Hq^{gk))k and N gradients 
{VHq^{gk))k can all be vectorized. Assuming that all column vectors gk^ Qk and Xk 
are gathered in n x A" matrices G, Q and X respectively, we define first the following 
n X N auxiliary matrices: 

^drfgG/7^ c=^^, A'^=Ao{KC), 

to form the vector of objectives 

H* ( 5 i),..., ( 5 ^)] = {Q o log(C)), 

and the matrix of gradients 

VH* (5 i),..., (5^)] = A. (3.4) 

3.2. Algorithm. The y-sWBP in Eq. (3.3) has a smooth objective with respect 

to each of its variables g^, a simple linear equality constraint, and both gradients 
and Hessians that can computed in closed form. We can thus compute a minimizer 
for that problem using a naive gradient descent outlined in Algorithm 1. To obtain a 
faster convergence, it is also possible to use accelerated gradient descent, quasi-Newton 
or truncated Newton methods [11, §10]. In the latter case, the resulting KKT linear 
system is sparse, and solving it with preconjugate gradient techniques can be efficiently 
carried out. We omit these details and only report results using off-the-shelf L-BFGS. 
From the dual iterates g^ stored in a n x A matrix G, one recovers primal iterates 
using the formula (3.2), namely pk = o At each intermediary iteration 

one can thus form a solution to the smoothed Wasserstein barycenter problem by 
averaging these primal solutions,p = Al^v/A. Upon convergence, these pk are all 
equal to the unique solution p^. The average at each iteration p converges towards 
that unique solution, and we use the sum of all line wise standard deviations of A: 

lJy^(A o A)l 7 v/A where A = to monitor that convergence in our 

algorithms. 

3.3. Initialization Heuristic. Definition 3.2 provides an initialization heuristic 
to initialize both the primal and dual smoothed WBP, motivated by the fact that they 
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Algorithm 1 Smoothed Wasserstein Barycenter, Generic Algorithm 

1 : Input: Q = [gi, • • • , q^] G metric M G barycenter weights A G Sat, 

7 > 0 , tolerance 5 > 0 . 

2 : initialize G G R^^^ and form the n x n matrix K = . 

3: repeat 

4: From gradient matrix A (see Eq. 3.4) produce update matrix A using either A 

directly or other methods such as L-BFGS. 

5: G = G — rA, update with fixed step length r or approximate line search to set 


6 : G = G — p^(GA)A^ (projection such that GX = 0 ) 

7: until lJy^(A o A)lAr/A^ < where A = A{In — 

8 : output barycenter p = Aljsf/N. 

provide directly the optimal primal/dual solutions when the histograms are Dirac 
histograms as proved in Proposition 3.3. 

Definition 3.2 (Primal and Dual WBP Initialization). Let (gi,*** ^Qn) be 
N target histograms in the simplex and A a veetor of weights in Let q = 

e ^n- Define as 




7 ~ 


where j G argmin^[Mg]^, if j = 0. 


For 7 ^ 0 ; the smoothed primal and dual WBP ean be initialized respeetively with 
the following primal and N dual feasible solutions: 




' 7 : 


(3.5) 


and for 1 ^ k ^ N, 


gP M{qk - q). 


(3.6) 


The primal initialization described above differs when 7 > 0 or 7 = 0: For 
7 > 0 , is the normalized, weighted geometric average of the columns of the kernel 
K = e~^l^\ when 7 = 0, is a vector of zero values except for a value of 1 on the 
index corresponding to the (or any, if many) smallest entry of Mg. On the other hand, 
the dual initialization is the same for both smoothed and non-smoothed Wasserstein 
barycenter problems. 

The initializations proposed in Definition 3.2 solve the WBP in the case that 
all histograms are Dirac histograms, as proved in Proposition 3.3. For more general 
problems, we have observed that this initialization is particularly useful when solving 
the WBP with the dual formulation, but not so much with the primal one. In many 
experimental problems we have considered, the dual initialization seems to capture 
important features of the optimal solution. The primal solution that results from this 
dual initialization, that obtained by averaging the gradients VM*^(^^) as suggested 
by the primal/dual relation of Equation (3.1), can serve as a rough approximation of 
the barycenter. We provide its explicit expression below. Note that differs 
from the initialization p^^^ suggested in Equation (3.5). 


( 0 ) 

nL\ = 




K 


Q 




8 








Proposition 3.3. Let X be a veetor of weights in and (gi,*** ^qn) be N 
Dime histograms, namely histograms that are zero everywhere but for one eoordinate 
equal to 1. For 7^0; the ^-sWBP primal and dual problems are solved exaetly using 
the initialization of Definition (3.2). 

Proof To simplify notations, we write p = p^^^ and g^ = g^^ as defined in 
Definition 3.2 above. First, one can easily check that both initialization satisfy the 
necessary constraints, z.e. p G and Xkgk = 0- 

When 7 = 0, since all are Dirac histograms, the Wasserstein distance of any 
point X in the simplex to any qk is equal to x^Mq^. Therefore, the Wasserstein 
barycenter objective evaluated at x is equal to x^Mq. This can be trivially minimized 
by selecting any histogram giving a mass of 1 to the index corresponding to any 
smallest entry in the vector Mg, which is the definition of p. A similar computation 
for the dual problem results in the dual optimal outlined above. 

When 7 > 0, we need to prove that each gradient of computed at g^ is equal 
to p for all 1 ^ k ^ N. Writing we recover that 


where , Since q^ is a Dirac histogram, all of its coordinates are equal 

to 0, but for one coordinate whose value is 1. Let f be the index of that coordinate. 
Therefore, fk — Xj, where Kj is the column of the matrix K = e~^l^. 

Therefore, 


Let us now compute the gradient V/c of at gk by following Eq. (2.6): 

Because of the symmetry of K, we have that the element of the vector Kak is 
equal to: 


{Kak)j = Kjak = {K, o a^) = 




= 1 . 


Since only the element of qk is non-zero by definition, qk/{Kak) = qk- Because qk 
is everywhere zero except for its coordinate, K{qk/Kak) is thus equal to the 
column of K, namely 


K-^=Kj. 

Kak 


Finally, we obtain that the gradient of iL* at gk is equal to 


Vk = ak o [ K 


Qk 


Kotk J 


A _ 


Kj 


o Kj = Kj = p ^^^, 


which holds for all indices 1 ^ k ^ N. □ 
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3.4. Smoothing and Stabilization of the WBP. We make the claim in this 
section that smoothing the WBP is not only beneficial computationally, but may also 
yield more stable computations. Of central importance in this discussion is the fact 
that the WBP can be cast as a LP of Nn^ + n variables and 2Nn constraints, and 
thus solved exactly for small n and N: 

N 

^ “ig y'Afe(Xfe, M) 

^^N,p f—: 
k=l 

s.t. Xfe G G Sn, (3-7) 

Xfe = qk,yk < N, 

= • • • = ~ P' 

Given couplings which are optimal solutions to Eq.(3.7), the solution to 

the WBP is equal to the marginal common to all those couplings: p'^ = X^ln for 
any k ^ N. For small N and n, this problem is tractable, but it can be surprisingly 
ill-posed as we see next. 

Indeed, it is also known that the 2 -Wasserstein mean of two univariate (continu¬ 
ous) Gaussian densities of mean and standard deviation (/ii,cri) and { 1121 (^ 2 ) respec¬ 
tively is a Gaussian of mean (/ii + /i 2)/2 and standard deviation (cri +cr 2)/2 [ 2 , §6.3]. 
This fact is illustrated in the top-left plot of Figure 3.1 where we display the average 
Wasserstein average A7(0,5/8) of the two densities A/’( 2 , 1 ) and A7(— 2 ,1/4). That 
plot is obtained by using smoothed spline interpolations of a uniformly spaced grid of 
100 values, as can be better observed in the top-right (stair) plot, where the discrete 
evaluations of these densities are respectively denoted pw^ Qi and g 2 - 

Naturally, one would expect the barycenter of qi and q 2 to be close, in some sense, 
to the discretized histogram pw of their true barycenter. Histogram p*, displayed 
in the bottom-left plot, is the exact optimal solution of Eq. (3.7), computed with 
the simplex method. That WBP reduces to a linear program of 2 x 100^ variables 
and 300 constraints. We observe that W^ip'^.qi) + ,q 2 ) = 0.5833950 whereas 

^2 + W 2 {pw^Q 2 ) = 0.5834070. The solution obtained with the simplex has, 

indeed, a smaller objective than the discretized version of the true barycenter. 

The bottom-right plot displays the solution of the smoothed Wasserstein barycen¬ 
ter problem (with smoothing parameter 7 = and a ground cost M that has been 
re-scaled to have a median value of 1). The objective value for that smoothed ap¬ 
proximation is 0.5834597. 

This numerical experiment does not contradict the fact that the discretized barycen¬ 
ter p* converges to the continuous barycenter as the grid size tends to zero, as shown 
in [13]. This observation illustrates however that, because it is defined as the argmin 
of a linear program, the true Wasserstein barycenter may be extremely unstable, even 
for such a simple problem and for large n as illustrated in Figure 3.2. Regularizing the 
Wasserstein distances has thus the added benefit of smoothing the resulting solution 
of the Wasserstein barycenter problem, and that of mitigating low sample size effects. 

3.5. Performance on the Wasserstein Barycenter Problem. We compare 
in this section the behavior of the smooth dual approach presented in this paper with 
that of (i) the smooth primal approach of [16], (ii) the dual approach of [13], and 
(hi) the Bregman iterative projections approach of [7]. We compare these methods 
on the simple task of computing the Wasserstein barycenter of 12 histograms laid out 
on the 100 x 100 grid, as previously introduced in [7, §3.2]. We outline briefly all four 
methods below, and follow by presenting numerical results. 
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Fig. 3.1: (top-left) two Gaussian densities and their barycenter (top right) same den¬ 
sities, discretized (bottom left) discretization of the true barycenter vs. the optimum 
of Equation 3.7 (bottom right) barycenter computed with our smoothing approach. 


n = 200 



Fig. 3.2: Plots of the exact barycenters for varying grid size n. 


Smooth primal first-order descent. [16, §5] proposed to minimize directly Eq. (3.1) 
with a regularizer 7 > 0. That objective can be evaluated by running N Sinkhorn 
fixed-point iterations in parallel. That objective is differentiable and its gradient 
is equal to 7 A/^ log (a/., where the are the left scalings obtained with that 

subroutine. A weakness of that approach is that a tolerance e for the Sinkhorn fixed- 
point algorithm must be chosen. Convergence for the Sinkhorn algorithm can be 
measured with a difference in li norm (or any other norm) between the row and 
column marginals of diag(a/c)e“^/'^ diag(/3/e) and the targeted histograms p and g/^. 
Setting that tolerance 5 to a large value ensures a faster convergence of the subroutine, 
but would result in noisy gradients which could slow the convergence of the algorithm. 
Because the smoothed dual approach only relies on closed form expressions we dot 
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Fig. 3.3: Number of quadratic operations (matrix vector product or min search in a 
matrix) vs. optimization gap to the smallest possible objective found after running 10^ 
iterations of all algorithms, log-log scale. Because the smooth primal/dual approaches 
optimize a different criterion than the dual approach, we plot both objectives. The 
Smooth dual L-BFGS converges faster in both smooth and non-smooth metrics. Note 
the crucial importance of the initialization proposed in §3.3. 

not have to take into account such a trade-off. 

Iterative Bregman Projeetions. [7, Prop. 1] recalls that the computation of the 
smoothed Wasserstein distance between p, q using the Sinkhorn algorithm can be 
interpreted as an iterative alternated projection of the n x n kernel matrix onto 

two affine sets, {X : Xl^ = p} and {X : = q}. That projection is understood 

to be in the Kullack-Leibler divergence sense. More interestingly, the authors also 
show that the smoothed WBP itself can also be tackled using an iterative alternated 
projection, cast this time in a space of dimension n x n x N. Very much like the 
original Sinkhorn algorithm, these projections can be computed for a cheap price, 
by only tracking variables of size n x N. This approach yields an extremely simple, 
parameter-free, generalization of the Sinkhorn algorithm which can be applied to the 
WBP. 

Smooth dual L-BFGS. The dual formulation with variables (pi, • • • , Pat) ^ 
of Eq. (3.3) can be solved using a constrained L-BFGS solver At each iteration of that 
minimization, we can recover a feasible solution p to the primal problem of Eq. (3.1) 
via the primal-dual relation p = ^ '^^qk(dk)‘ 

Dual (7 = 0) with L-BFGS. This approach amounts to solving directly the (non- 
differentiable) dual problem described in Eq. (3.3) with no regularization, namely 
7 = 0. Subgradients for the Eenchel-Legendre transforms can be obtained in 
closed form through Proposition 2.5. As with the smoothed-dual formulation, we can 
also obtain a feasible primal solution by averaging subgradients. We follow [13] ’s rec- 
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ommendation to use L-BFGS. The non-smoothness of that energy is challenging: We 
have observed empirically that a naive subgradient method applied to that problem 
fails to converge in all examples we have considered, whereas the L-BFGS approach 
converges, albeit without guarantees. 

Averaging Truncated Mixtures of Gaussians. We consider the 12 truncated mix¬ 
tures of Gaussians introduced in [7, §3.2]. To compare computational time, we use 
Nn^ elementary operations as the computation unit. These Nn^ operations corre¬ 
spond to matrix-matrix products in the smoothed Wasserstein case, and Nn com¬ 
putations of nearest neighbor assignments among n possible neighbors. Note that 
in both cases (Gaussian matrix product and nearest neighbors under the L 2 metric) 
computations can be accelerated by considering fast Gaussian convolutions and kd- 
trees for fast nearest neighbor search. We do not consider them in this section. We 
plot the optimality gap w.r.t the optimum of these 4 techniques as a function of the 
number of computations, by taking as a reference the lowest value attained across all 
methods. This value is attained, as in [7], by the iterative Bregman projections ap¬ 
proach after 771 iterations (not displayed in our graph). We show these gaps for both 
the smoothed (7 = 1/100) and non-smoothed objectives (7 = 0). We observe that 
the iterative Bregman approach outperforms all other techniques. The smoothed- 
dual approach follows closely, notably when initialized with the formula provided in 
Definition 3.2. 

4. Regularized Problems. We show in this section that our dual optimization 
framework is versatile enough to deal with functionals involving Wasserstein distances 
that are more general than the initial WBP problem. 

4.1. Regularized Wasserstein Barycenters. In order to enforce additional 
properties of the bary centers, it is possible to penalize (3.1) with an additional convex 
regularization, and consider 


N 


Y,WkHg^{p) + J{Ap), 


(4.1) 


mm 


k=l 


where J is a convex real-valued function, and ^ is a linear operator. 

The following proposition shows how to compute such a regularized barycenter 
through a dual optimization problem. 

Proposition 1. The dual problem to (4.1) reads 


N 



(4.2) 


where H = 



and the primal-dual relationships read 


Vfc = l,...,7V, p = VHl(uk). 


(4.3) 


Proof. We re-write the initial program (4.1) as 


min F{B7r) + G( 7 r) 


(4.4) 


TT 
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where we denoted, for tt = (p,pi,...,pw), 

Btt =■ {Ap,p,Pi,...,pn) 
F{/3,q,pi,...,pN) =' J{I3) + Lc{q,Pi,---,PN), 
G{p, Pi,..., pjv) =■ ^ WkHg^ (pfe) 

k 


for C . ^Pn) ; V/c,j 9 /c = g'}. The Fenchel-Rockafelear dual to (4.4) reads 

max - F%u) 

U = {v,U,{Uk)k} 


where 


G*{u,ui,.. .,un) = '^WkH*^{uk/wk) + (•{o}(w), 

k 

B*{v) = {A*V + M, Ml, . . . , Mjv), 

F*{v) = J*{v) + ic±{u,Ui, . . . ,Mjv), 


where C-*- = {(m, Mi, ...,Mjv) ; u + Uk = 0}. One thus obtains the dual 


min WkH*A-Uk/wk) s.t. 

v,byuk)k j 


f A*v -\-u = 0, 

1 u + ^j^Uk = 0. 


The primal-dual relationships reads tt G 9G*(—and hence (4.3). Changing 
—Uk/wk into Uk give the desired formula. □ 

Relevant examples of penalizations J include: 

• In order to enforce some spread of the barycenter, one can use ^ = Id and 

J{p) = IIIpIP: which case J*(^) = In contrast to (3.3), the dual 

problem (4.2) is equivalent to an unconstraint smooth optimization. This 
problem can be solved using a simple Newton descent. 

• One can also enforce that the barycenter entries are smaller than some maxi¬ 
mum value p by setting ^ = Id and J = lq where C = {p ; ||p||oo ^ p}- In this 
case, one has J*(p) = p||p||i- The optimization (4.2) is equivalent an uncon¬ 
strained non-smooth optimization. Since the penalization is an A norm, one 
solve it using first order proximal methods as detailed in Section 4.2 bellow. 

• To force the barycenter to assume some fixed values pj G on a given set 
/ of indices, one can use ^ = Id and J = lc where C = {p ; p/ = p^} where 
Pi = {Pi)iei- One then has J*{g) = {gj, p°j) + (.{o}(5/0- 

• To force the barycenter to have some smoothness, one can select A to be 
a spacial derivative operator (for instance a gradient approximated on some 
grid or mesh) and J to be a norm such as an 1“^ norm (to ensure uniform 
smoothness) or an A norm (to ensure piecewise regularity). We explore this 
idea in Section 4.3. 


4.2. Resolution using First Order Proximal Splitting. Assuming without 
loss of generality that re at 7 ^ 0 (otherwise one simply needs to permute the ordering 
of the input densities), one can note that it is possible to remove un from (4.2) by 
imposing, for x = ((«*;)£"/, f) 


un{x) = - 


A*v 

WN 


JV-1 


E 


Wk 

- Uk, 

WN 
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and then one can consider the following optimization problem without the H con¬ 
straint 


min F{x) -1- G{x) where 


F{x) = Y.k=l {um{x)) 

G{x) "= J*{v). 


We assume that one is able to compute the proximal operator of J* 
ProxT-j*(i;) argmin ^\\v — v'\f 

v' ^ 


(4.5) 


(4.6) 


It is for instance an orthogonal projector on a convex set C when J* = lc is the 
indicator of C. One can compute easily this projection for instance when J is the 
or the norm (see Section 4.3). We refer to [5] for more background on proximal 
operators. 

The proximal operator of G is then simply 

Va; = ((wfe)£'itv), Prox^G(a:) = ((Mfc)^^/, Prox^j.(v)). 


Note also that the function F is smooth with a Lipschitz gradient, and that 

^n{uk)^=i\v) = {{wk{VHl{uk)-yHl^{uN)))^-^\-AyHl^{uN)) 

The simplest algorithm to solve (4.5) is the Forward-Backward algorithm, whose 
iteration read 

^(^+1) _ Prox^-j* — rVF(x^^^)^ . (4.7) 

It r < 2/L where L is the Lipschitz constant of VF, then converge to a solution 
of (4.5), see [5] and the references therein. In order to accelerate the convergence of 
the method, one can use accelerated schemes such as FISTA’s algorithm [6]. 

4.3. Total Variation Regularization. A typical example of regularization to 
enforce some geometrical regularity in the barycenter is the total variation regular¬ 
ization on a grid in (e.g. d = 2 for images). It is obtained by considering 


Ap =■ Vp = (Vip)i and J{u) =' ||Mi||/ 3 , (4.8) 

i 


where G is a finite difference approximation of the gradient at a point indexed 
by i, and A ^ 0 is the regularization strength. When using the norm to measure 
the gradient amplitude, i.e. /d = 2, one obtains the so-called isotropic total variation, 
that tends to round corners, and essentially penalizes the length of the level sets of 
the barycenter, possibly merging clusters together. When using instead the norm, 
i.e. /d = 1, one obtains the so-called anisotropic total variation, which penalizes 
independently horizontal and vertical derivative, thus favoring the emergence of axis- 
aligned edges, and giving a “crystalline” look to the barycenters. We refer for instance 
to [14] for a study of the effect of TV regularization on the shapes of levelsets using 
isotropic and crystalline total variations. 

In this case, it is possible to compute in closed form the proximal operator (4.6). 
Indeed, one has J* = ^||.||^=,^a where /3* is the conjugate exponent 1//3 + 1//3* = 1. 
One can compute explicitly the proximal operator in the case [3 G {1,2} since they 
correspond to orthogonal projectors on balls 


Prox^j* 



min(max(i;^, — A), A) if /3 = 1, 
II 1^ if 13 = 2. 


15 



4.4. Barycenters of Images. We start by computing barycenters of a small 

number of 2-D images, that are discretized on an uniform rectangular grid of n = 
256 X 256 pixels We use either the isotropic (/3 = 2) or anisotropic {P = 1) 

total variation presented above, where V is defined using standard forward finite 
differences along each axis, and using Neumann boundary conditions. The metric is 
the usual squared Euclidean metric 

V(i,i) G = \zi - Zjf . (4.9) 

The Gibbs kernel K = is a filtering with a Gaussian kernel, that can be 

applied efficiently to histograms in nearly linear time, see [31] for more details about 
convolutional kernels. 

Figure 4.1 shows examples of barycenters of = 4 input histograms computed by 
solving (4.1) using the projected gradient descent method (4.7). The input histograms 
represent 2-D shapes, and are uniform (constant) distributions inside the support of 
the shapes. Note that in general the barycenters are not shapes, i.e. they are not 
uniform distributions, but this method can nevertheless be used to define meaningful 
averaging of shapes as exposed in [31]. Figure 4.1 compares the effects of /3 G {1,2}, 
and one can clearly see how the isotropic total variation (^ = 2) rounds the corners 
of the input densities, while the anisotropic version (/3 = 1) favors horizontal/vertical 
edges. 

Figure 4.2 shows the influence of the regularization strength A to compute the 
iso-barycenter oi N = 2 shapes. This highlights the fact that this total variation 
regularization has the tendency to group together small clusters, which might be 
beneficial for some applications, as illustrated in Section 4.5 on MEG data denoising. 

4.5. Barycenters of MEG Data. We applied our method to a magnetoen¬ 
cephalography (MEG) dataset. In this setup, brain activity of a subject is recorded 
(Elekta Neuromag, 306 sensors of which 204 planar gradiometers and 102 magne¬ 
tometers, sampling frequency lOOOHz) while the subject reacted to the presentation 
of a target stimulus by pressing either the left or the right button. 

Data is preprocessed applying signal space separation correction, interpolation of 
noisy sensors, and realignment of data into a subject-specific head position (MaxFil- 
ter, Elekta Neuromag). The signal was then filtered (low pass 40HZ), and artifacts 
such as blinks and heartbeats removed thanks to Signal-Space Projection using the 
Brainstorm software^. The samples we used for our barycenter computations are an 
average of the norm of the two gradiometers for each channel from stimulation onto 
50ms and the classes were left or right button. 

This results in two classes of recordings, one for each pressed button. We aim at 
computing a representative activity map for each class using Wasserstein barycenters. 
For each class we have N = 33 recordings {qk)k=i each having n = 66 samples located 
on the vertices of an hexahedral mesh of a hemisphere (corresponding to a MEG 
recording helmet). These recorded values are positive by construction, and we rescale 
them linearly to impose Qk G Figure 4.3, top row, shows some samples from 
this dataset, displayed using interpolated colors as well as iso-level curves. The black 
dots represent the position (^i)f=i of the electrodes on the half-sphere of the helmet, 
flattened on a 2-D disk. 

We computed TV-regularized barycenters independently for each class by solv¬ 
ing (4.1) with the TV regularization using the projected gradient descent method (4.7). 

^http://neuroimage.use.edu/brainstorm 
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(a) A = 0 

(b) Isotropic, A = 100 













•*■4"*^'*’* 




(c) Anisotropic A = 500 (d) Anisotropic A = 2000 


Fig. 4.1: Examples of isotropic and anisotropic TV regularization for the computa¬ 
tion of barycenters between four input densities. The weights are bilinear 

interpolation weights, so that it is for instance w = (1,0,0,0) on the top left corner 
and (0, 0, 0,1) on the bottom right corner. 

We used a squared Euclidean metric (4.9) on the flattened hemisphere. Since the data 
is defined on an irregular graph, instead of (4.8), we use a graph-based discrete gra¬ 
dient. We denote {{h j))(i,j)eQ graph which connects neighboring electrodes. The 
gradient operator on the graph is 

Vp e M", Ap = (pi - Pj)(i,j)eG e 

The total variation on this graph is then obtained by using J = A|| • ||i, the norm, 
i.e. we use /3 = 1 in (4.8). 

Eigure 4.3 compares the naive P barycenters (i.e. the usual mean), barycenters 
obtained without regularization (i.e. A = 0) and barycenters computed with an 
increasing regularization strength A. The input histograms {pk)k being very noisy, 
the use of regularization is important to make the area of significant activity emerge 
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A = 0 A = 20 A = 40 A = 60 A = 80 A = 100 A = 200 A = 300 


Fig. 4.2: Influence of A parameter for the iso-barycenter (i.e. w = (1/2,1/2)) between 
two input densities (they are the upper-left and upper-right corner of the A = 0 case 
in Figure 4.1). Top row: isotropic total variation {/3 = 2). Bottom row: anisotropic 
total variation (^d = 1). 


Class 1 


Class 2 



Sample 1 Sample 2 Sample 3 Mean 


Sample 1 Sample 2 Sample 3 Mean 



A=0 A=2 A=4 A=8 


A=0 A=2 A=4 A=8 


Fig. 4.3: Barycenter computation on MEG data. The left/right panels shows respec¬ 
tively the first and the second class, corresponding to recordings where the subject is 
asked to push the left or the right button. Top row: examples of input histograms 
Qk for each class, as well as the P mean Bottom row: computed TV- 

regularized barycenter for different values of A (A = 0 corresponding to no regulariza¬ 
tion). 


from the noise. The use of a TV regularization helps to keep a sharp transition 
between active and non-active regions. 

4.6. Gradient Flow. Instead of computing barycenters, we now use our reg¬ 
ularization to define time-evolutions, which are defined through a so-called discrete 
gradient flow. 

Starting from an initial histogram po ^ we define iteratively 

Pk+i =' argmin iJp, (p) + r/(p). (4.10) 

pe^N 

This means that one seeks a new iterate at (discrete time) /c + 1 that is both close 
(according to the Wasserstein distance) to pk and minimizes the functional /. In the 
following, we consider the gradient flow of regularization functionals as considered 
before, i.e. that are of the form rf = J o A. Problem (4.10) is thus a special case 
of (4.1) with N = 1. 

Letting k Too, one can informally think of pk as a discretization of a time 
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evolution evaluated at time t = kr. This method is a general scheme presented 
in much detail in the monograph [3]. The use of an implicit time-stepping (4.10) 
allows one to define time evolutions to minimize functionals that are not necessarily 
smooth, and this is exactly the case of the total variation semi-norm (since J is not 
differentiable). The use of gradient flows in the context of the Wasserstein fidelity to 
the previous iterate has been introduced initially in the seminal paper [18]. When / is 
the entropy functional, this paper proves that the countinous flow defined by the limit 
k ^ -\-oo and r ^ 0 is a heat equation. Numerous theoretical papers have shown how 
to recover many existing non-linear PDE’s by considering the appropriate functional 
/, see for instance [25, 17]. 

The numerical method we consider in this article is the one introduced in [27], 
that makes use of the entropic smoothing of the Wasserstein distance. It is not the 
scope of the present paper to discuss the problem of approximating gradient flows 
and the underlying limit non-linear PDE’s, and we refer to [27] for an overview of the 
vast literature on this topic. A major bottleneck of the algorithm developed in [27] is 
that it uses a primal optimization scheme (Dykstra’s algorithm) that necessitates the 
computation of the proximal operator of / according to the Kulback-Leibler diver¬ 
gence. Only relatively simple functionals (basically separable functionals such as the 
entropy) can thus be treated by this approach. In contrast, our dual method can cope 
with a much larger set of functions, and in particular those of the form f = Jo A, i.e. 
obtained by pre-composition with a linear operator. 



Eigure 4.4 shows examples of gradient flows computed for the isotropic total 
variation f{p) = ||Vp||i as defined in (4.8). We use the discretization setup considered 
in Section 4.4. This is exactly the regularization flow considered by [12]. This paper 
defines formally the highly non-linear fourth order PDE corresponding to the limit 
flow. This is however not a “true” PDE since the initial TV functional is non-smooth, 
and derivatives should be understood in a weak sense, as limit of an implicit discrete 
time stepping. While the algorithm proposed in [12] uses the usual (unregularized) 
Wasserstein distance, the use of a regularized transport allows us to deal with problems 
of larger sizes, with a faster numerical scheme. The price to pay is an additional 
blurring introduced by the entropic smoothing, but this is acceptable for applications 
to denoising in imaging. Eigure 4.4 illustrates the behavior of this TV regularization 
flow, which has the tendency to group together clusters of mass, and performs some 
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kind of progressive “percolation” over the whole image. 

Conclusion. In this paper, we introduced a dual framework for the resolution 
of certain variational problems involving Wasserstein distances. The key contribu¬ 
tion is that the dual functional is smooth and that its gradient can be computed in 
closed form and involves only multiplications with a Gibbs kernel. We illustrate this 
approach with applications to several problems revolving around the idea of Wasser¬ 
stein barycenters. This method is particularly advantageous for the computation of 
regularized barycenters, since pre-composition by linear operator (such as discrete 
gradient on images or graphs) of functionals is simple to handle. Our numerical find¬ 
ings is that entropic smoothing is crucial to stabilize the computation of barycenters 
and to obtain fast numerical schemes. Further regularization using for instance a total 
variation is also beneficial, and can be used in the framework of gradient flows. 
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Appendix A. Legendre Transform with Respect to Two Histograms. 

Theorem 2.4 can be extended to study the Legendre transform of Wry{p,q) with 
respect to both arguments (p, q) instead of only p. Indeed, expression (2.5) shows 
that (p^q) ^ W^{p^q) is a convex function (as a maximum of linear forms), so that 
one can define V (p, h) G x 


{ 9 , h) = max {g, p) + {h, q) - W{p, q). 
P,qe^n 


The following proposition adapts to this setting. 

Proposition A.l. The function W* is at (g^h) G x and, writing 
K = e~^l^, a = and ICag = didig{a)K/3, we have that 


W;ig,h) = -^loga^Kl3, 



where 



Moreover, the gradient function {g,h) i-G- \/W*{g,h) is 2/7 Lipschitz. 
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Proof. One has that Wf {g, h) can be written 

max {g, p) + {h, q) - max {u, p) + {v, q) - v) 

U,v 

= max — max {u + p) + ('^ + q) — v) 

= max — max {u', p) + {v', q) — jS^ m{u' g,v' ^ h) 

p,q u',v' ’ 

= max -WM+giT+ihTiPyq) 

p,q 

= max — min (X, M — gl^ — Ih^) — ^E{X) 

p,q xeu(p,q) 

= - min (X, M -gt^ - th^) - -fE{X). 

xeT.^2 

One verifies that the last Eq. is equivalent to a classic maximal entropy problem 
which can be solved uniquely with a Gibbs distribution equal to X* given below, 

^ diag(Q;).h: diag(^) 

a^K[3 

Substituting this expression in the formula above for W^{g^ h) yields that 

W;{g,h) = -^\oga^Kp. 

Since the gradients with respect to g and h of W^{g,h) are X *1 and X*^l respec¬ 
tively, this results in the expression provided above. The Hessian follows from that 
result, and the Lipschitz continuity of the gradient can be obtained by showing that 
the Hessian’s trace can be upper-bounded by 2/7 by noticing that the trace of both 
A^{g,h) and A^{h^g) is upper-bounded by a'^Kp. □ 
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